{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0d352545-a8c6-45ad-8359-c9b6edd2b7d2",
   "metadata": {},
   "source": [
    "# Vector-Search with SuperDuperDB\n",
    "\n",
    "## Introduction\n",
    "This guide shows how to use SuperDuperDB for vector search, a powerful technique to find similar documents. We'll cover the setup and demonstrate searching a dataset of documents.\n",
    "Vector search with SuperDuperDB is a useful tool in various situations:\n",
    "\n",
    "1. **Efficient Content Search:** Forget old full-text search. Now, use vector search to quickly find what you need in your content.\n",
    "\n",
    "2. **Visual Product Discovery:** Users discover similar products by uploading images, making shopping easier.\n",
    "\n",
    "3. **Smart Recommendations:** Get recommendations based on both visual and text features, making user experiences better.\n",
    "\n",
    "4. **Content Analysis:** Identify similar content and images for thorough fact-checking and reporting.\n",
    "\n",
    "5. **RAG Chatbots:** Essential for RAG chatbots in language models, making them more effective.\n",
    "\n",
    "These examples show how vector search makes tasks more efficient and improves user experiences in different areas.\n",
    "\n",
    "One of the remarkable features of SuperDuperDB is its ability to pull data from any type of database, vectorize it, and perform vector searches. Here's an example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f283b5675bea4619",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1f9e69e-75f4-42f9-a48d-b1f68f02646d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install superduperdb\n",
    "!pip install ipython openai==1.1.2 numpy==1.24.4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f79d7ef8-46eb-4210-8d96-a09648314e37",
   "metadata": {},
   "source": [
    "Additionally, ensure that you have set your openai API key as an environment variable. You can uncomment the following code and add your API key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1c8e68c-045f-44b8-bfbf-4c9dff5cf30c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "#os.environ['OPENAI_API_KEY'] = 'sk-...'\n",
    "\n",
    "if 'OPENAI_API_KEY' not in os.environ:\n",
    "    raise Exception('You need to set an OpenAI key as environment variable: \"export OPEN_API_KEY=sk-...\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4db1c2b4-e0b3-420f-ba1c-bd49655bff2b",
   "metadata": {},
   "source": [
    "## Connect to datastore \n",
    "\n",
    "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n",
    "Here are some examples of MongoDB URIs:\n",
    "\n",
    "* For testing (default connection): `mongomock://test`\n",
    "* Local MongoDB instance: `mongodb://localhost:27017`\n",
    "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n",
    "* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e097557-7c50-4442-9e38-1df8a9d8f211",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import superduper\n",
    "from superduperdb.backends.mongodb import Collection\n",
    "import os\n",
    "\n",
    "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n",
    "\n",
    "# SuperDuperDB, now handles your MongoDB database\n",
    "# It just super dupers your database by initializing a SuperDuperDB datalayer instance with a MongoDB backend and filesystem-based artifact store\n",
    "db = superduper(mongodb_uri, artifact_store='filesystem://./data/')\n",
    "\n",
    "# CReference collection named 'documents'\n",
    "doc_collection = Collection('documents')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bb5ee2b-f0bb-4660-961d-fdf98833f33d",
   "metadata": {},
   "source": [
    "## Load Dataset \n",
    "\n",
    "We have prepared a dataset, which is the inline documentation of the pymongo API. Let's load this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "049b4122-b2c9-4ca5-be3c-df788912ce34",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the 'pymongo.json' file using curl\n",
    "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json\n",
    "\n",
    "# Import the json module to work with JSON data\n",
    "import json\n",
    "\n",
    "# Load the content of the 'pymongo.json' file into the 'data' variable\n",
    "with open('pymongo.json') as f:\n",
    "    data = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "420ef3662c07d91e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "As usual, we insert the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "468ec3dc-fa1f-4c23-b569-456b8900b72c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "\n",
    "# Insert multiple documents into the 'documents' collection\n",
    "db.execute(doc_collection.insert_many([Document(r) for r in data]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aba4aa66-2aec-4986-b263-c510788bf478",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Execute a query to find one document in the 'documents' collection\n",
    "result = db.execute(Collection('documents').find_one())\n",
    "\n",
    "# Print the result\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7f78f2d-86c0-463f-8eb1-630cd65d48ef",
   "metadata": {},
   "source": [
    "## Create Vectors\n",
    "\n",
    "In the remainder of the notebook, you can choose between using the `openai` or `sentence_transformers` libraries to perform vector search. After instantiating the model wrappers, the rest of the notebook remains identical.\n",
    "\n",
    "For OpenAI vectors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0a30873-b7fc-4ec5-ace9-f3d4ca01bab2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the OpenAIEmbedding model from SuperDuperDB\n",
    "from superduperdb.ext.openai.model import OpenAIEmbedding\n",
    "\n",
    "# Initialize an instance of the OpenAIEmbedding model with the 'text-embedding-ada-002' model\n",
    "model = OpenAIEmbedding(model='text-embedding-ada-002')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e8d1d264dd7ba1b",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "For Sentence-Transformers vectors, uncomment the following section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a14c5c5f-c770-4a94-884c-3705f1d0a627",
   "metadata": {},
   "outputs": [],
   "source": [
    "#import sentence_transformers\n",
    "#from superduperdb import Model, vector\n",
    "\n",
    "#model = Model(\n",
    "#    identifier='all-MiniLM-L6-v2', \n",
    "#    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),\n",
    "#    encoder=vector(shape=(384,)),\n",
    "#    predict_method='encode', # Specify the prediction method\n",
    "#    postprocess=lambda x: x.tolist(),  # Define postprocessing function\n",
    "#    batch_predict=True, # Generate predictions for a set of observations all at once \n",
    "#)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b546308d-45f2-4605-8778-7aca46fe3c7c",
   "metadata": {},
   "source": [
    "## Index Vectors\n",
    "\n",
    "Now we can configure the Atlas vector-search index. This command saves and sets up a model to `listen` to a particular subfield (or the whole document) for new text, converts it on the fly to vectors, and then indexes these vectors using Atlas vector-search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46ce7a59-cdd2-46e5-a218-77ef73df7a95",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Listener, VectorIndex\n",
    "\n",
    "# Add a VectorIndex to the SuperDuperDB instance\n",
    "db.add(\n",
    "    VectorIndex(\n",
    "        # Use a dynamic identifier based on the model's identifier\n",
    "        identifier=f'pymongo-docs-{model.identifier}',\n",
    "        \n",
    "        # Specify an indexing listener with MongoDB collection and model\n",
    "        indexing_listener=Listener(\n",
    "            select=doc_collection.find(),  # MongoDB collection query\n",
    "            key='value',  # Key for the documents\n",
    "            model=model,  # Specify the model for processing\n",
    "            predict_kwargs={'max_chunk_size': 1000},  # Additional prediction arguments\n",
    "        ),\n",
    "    )\n",
    ")\n",
    "\n",
    "# Display the vector indexes in the SuperDuperDB instance\n",
    "db.show('vector_index')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41843d4f",
   "metadata": {},
   "source": [
    "## Perform Vector Search\n",
    "\n",
    "Now that the index is set up, we can use it in a query. SuperDuperDB provides some syntactic sugar for the `aggregate` search pipelines, which can be helpful. It also handles all the conversion of inputs to vectors under the hood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0184fb1-10ae-4488-9e93-c56b5fcd9ac2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary classes\n",
    "from superduperdb import Document\n",
    "from IPython.display import *\n",
    "\n",
    "# Define the search parameters\n",
    "search_term = 'Query the database'\n",
    "num_results = 5\n",
    "\n",
    "# Execute the query\n",
    "result = db.execute(\n",
    "    doc_collection\n",
    "        .like(Document({'value': search_term}), vector_index=f'pymongo-docs-{model.identifier}', n=num_results)\n",
    "        .find()\n",
    ")\n",
    "\n",
    "# Display a horizontal line\n",
    "display(Markdown('---'))\n",
    "\n",
    "# Iterate through the query results and display them\n",
    "for r in sorted(result, key=lambda r: -r['score']):\n",
    "    # Display the document's parent and res values in a formatted way\n",
    "    display(Markdown(f'### `{r[\"parent\"] + \".\" if r[\"parent\"] else \"\"}{r[\"res\"]}`'))\n",
    "    \n",
    "    # Display the value of the document\n",
    "    display(Markdown(r['value']))\n",
    "    \n",
    "    # Display a horizontal line\n",
    "    display(Markdown('---'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
